Project Summary

This project analyzes customer purchasing behavior by segmenting products into two cost groups using clustering techniques. Additionally, RFM(Recency, Frequency, Monetary) analysis is conducted to classify customers based on purchasing activity and value. The study identifies high- and low-value customer segments, frequent purchase patterns, and product associations using association rule mining (Apriori Algorithm). Insights from these Analysis enable data-driven decision-making, such as personalized marketing strategies, product bundling, and inventory optimization. The findings are visualized through an interactive Shiny dashboard, providing actionable intelligence for business growth.

LOAD DATA FROM SQL SERVER

con <- dbConnect(odbc::odbc(),
                 Driver = "ODBC Driver 17 for SQL Server",
                 Server = "localhost",
                 Database = "porfolio",
                 Trusted_Connection = "Yes")

df <- dbGetQuery(con, "SELECT * FROM SalesB")

dbDisconnect(con)

df_copy<-df

EXPLORE DATA

DATA SUMMARY: The dataset consists of 522,316 transactions, including InvoiceNo, StockCode, Description, Quantity, InvoiceDate, UnitPrice, CustomerID, and Country. Transactions span from December 2010 to December 2011. The Quantity ranges from 1 to 500, while UnitPrice varies from 0.001 to 649.5. There are 131,420 missing CustomerIDs, indicating anonymous purchases. The dataset captures global transactions across multiple countries.

head(df)
##   InvoiceNo StockCode                     Description Quantity
## 1    536592     22489  PACK OF 12 TRADITIONAL CRAYONS        2
## 2    536592     22499       WOODEN UNION JACK BUNTING        1
## 3    536592     22501      PICNIC BASKET WICKER LARGE        2
## 4    536592     22503       CABIN BAG VINTAGE PAISLEY        1
## 5    536592     22508        DOORSTOP RETROSPOT HEART        2
## 6    536592     22530 MAGIC DRAWING SLATE DOLLY GIRL         3
##           InvoiceDate UnitPrice CustomerID        Country
## 1 2010-12-01 17:06:00      0.85         NA United Kingdom
## 2 2010-12-01 17:06:00     12.72         NA United Kingdom
## 3 2010-12-01 17:06:00     21.23         NA United Kingdom
## 4 2010-12-01 17:06:00     59.53         NA United Kingdom
## 5 2010-12-01 17:06:00      7.62         NA United Kingdom
## 6 2010-12-01 17:06:00      0.85         NA United Kingdom
summary(df)
##   InvoiceNo          StockCode         Description           Quantity      
##  Length:522316      Length:522316      Length:522316      Min.   :  1.000  
##  Class :character   Class :character   Class :character   1st Qu.:  1.000  
##  Mode  :character   Mode  :character   Mode  :character   Median :  4.000  
##                                                           Mean   :  9.631  
##                                                           3rd Qu.: 11.000  
##                                                           Max.   :500.000  
##                                                                            
##   InvoiceDate                       UnitPrice         CustomerID    
##  Min.   :2010-12-01 08:26:00.00   Min.   :  0.001   Min.   :12347   
##  1st Qu.:2011-03-28 12:13:00.00   1st Qu.:  1.250   1st Qu.:13969   
##  Median :2011-07-20 11:55:00.00   Median :  2.080   Median :15157   
##  Mean   :2011-07-04 16:52:47.49   Mean   :  3.289   Mean   :15294   
##  3rd Qu.:2011-10-19 11:49:00.00   3rd Qu.:  4.130   3rd Qu.:16794   
##  Max.   :2011-12-09 12:50:00.00   Max.   :649.500   Max.   :18287   
##                                                     NA's   :131420  
##    Country         
##  Length:522316     
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 

ANALYZING NUMERIC VARIABLE (QUANTITY AND PRICE)

This step examines the distribution and relationship between Quantity and UnitPrice. The Anderson-Darling normality test results (p-value < 2.2e-16) confirm that both variables deviate significantly from a normal distribution. Boxplots highlight the presence of outliers, particularly for high UnitPrice values. A Spearman correlation of -0.41 suggests a moderate negative, non-linear relationship. Scatter plots reveal that as price increases, quantity purchased drops sharply at first, then levels off, indicating that customers are highly sensitive to price changes at lower values but less reactive at higher prices. A smoothed trend line further confirms this diminishing impact of price on quantity.

1. Normality Test

# Anderson-Darling test for Quantity
ad.test(df$Quantity)
## 
##  Anderson-Darling normality test
## 
## data:  df$Quantity
## A = 95707, p-value < 2.2e-16
# Anderson-Darling test for Quantity
ad.test(df$UnitPrice)
## 
##  Anderson-Darling normality test
## 
## data:  df$UnitPrice
## A = 50549, p-value < 2.2e-16

2. Identifying Outliers with Boxplots:

boxplot(df$UnitPrice, main = "Boxplot of UnitPrice", col = "lightblue")

boxplot(df$Quantity, main = "Boxplot of Quantity", col = "lightblue")

3. Checking Correlation Between Quantity and Unit Price:

cor(df$Quantity, df$UnitPrice, method = "spearman", use = "complete.obs") # non-linear
## [1] -0.4063089
ggplot(df, aes(x = UnitPrice, y = Quantity)) +
  geom_point() +
  coord_cartesian(xlim = c(0, 30), ylim = c(0, 100)) +  # zoom into a specific range
  geom_smooth()+
  theme_minimal() +
  labs(title = "Zoomed-In Scatter Plot", x ="UnitPrice" , y = "Quantity")

Customer Purchase Behaviour Analysis

The analysis segmented customers using RFM (Recency, Frequency, Monetary) analysis and classified them into High Value and Low Value groups based on Customer Lifetime Value (CLV). The top 10 countries for each segment were identified and visualized. The most frequently purchased products by high and low-value customers were analyzed. A time-series analysis of purchase trends showed variations in spending patterns over time. These insights help businesses target high-value customers effectively and optimize product offerings and marketing strategies.

  1. RFM Analysis (Recency, Frequency, Monetary Value)
# Create a snapshot date (latest date in the dataset for recency calculation)
snapshot_date <- max(df$InvoiceDate)

# Calculate Recency, Frequency, and Monetary
rfm_data <- df %>%
  group_by(CustomerID) %>%
  summarise(
    Recency = as.numeric(difftime(snapshot_date, max(InvoiceDate), units = "days")),  # Days since last purchase
    Frequency = n_distinct(InvoiceNo),  # Number of purchases
    Monetary = sum(Quantity * UnitPrice)  # Total spending
  )

# Scale the RFM values
rfm_data <- rfm_data %>%
  mutate(
    Recency_Scaled = scale(Recency),
    Frequency_Scaled = scale(Frequency),
    Monetary_Scaled = scale(Monetary)
  )

# View the result
#head(rfm_data)

Segment Customers:

  • High Value: CLV above the median.
  • Low Value: CLV below the median.
# Calculate Average Order Value (AOV)
avg_order_value <- mean(rfm_data$Monetary)

# Calculate CLV by customer segment
rfm_data <- rfm_data %>%
  mutate(
    CLV = Monetary * Frequency  # A simple estimate of CLV (you can refine this model)
  )

# Classify customers as High or Low Value based on CLV
rfm_data <- rfm_data %>%
  mutate(
    CLV_Segment = ifelse(CLV > median(CLV), "High Value", "Low Value")
  )

# View the result
head(rfm_data)
## # A tibble: 6 × 9
##   CustomerID Recency Frequency Monetary Recency_Scaled[,1] Frequency_Scaled[,1]
##        <int>   <dbl>     <int>    <dbl>              <dbl>                <dbl>
## 1      12347    1.87         7   4310.              -0.902               0.110 
## 2      12348   75.0          4   1437.              -0.172              -0.0251
## 3      12349   18.1          1   1458.              -0.740              -0.161 
## 4      12350  310.           1    294.               2.17               -0.161 
## 5      12352   35.9          7   1386.              -0.562               0.110 
## 6      12353  204.           1     89.0              1.12               -0.161 
## # ℹ 3 more variables: Monetary_Scaled <dbl[,1]>, CLV <dbl>, CLV_Segment <chr>
  1. Identify Dominant Countries
  • Group customers by country and CLV segment.
  • Rank countries by customer count within each segment.
  • Select the top 10 countries for high-value and low-value customers.
  • Visualize the results using bar charts.

  1. Identify Favorite Products
  • Group products by CLV segment.
  • Sum total quantity purchased per product.
  • Rank the products based on total quantity.
  • Select and visualize the top 10 products for both high-value and low-value customers.

  1. Analyze Purchase Trends Over Time
  • Convert InvoiceDate to a proper date format.
  • Extract Year-Month from purchase dates.
  • Aggregate total purchase value per month by CLV segment.
  • Visualize purchase trends over time using a line chart.

Association Rules (Apriori or FP-Growth)

The approach involves preparing transaction data by grouping product descriptions for each invoice and filtering transactions with more than three items. The items are then split and converted into a “transactions” format for Association Rule Mining. The Apriori algorithm is applied with specified thresholds for support and confidence to uncover frequent itemsets. The left-hand side (lhs) and right-hand side (rhs) of the rules are extracted, representing commonly co-occurring products. The frequency of products is calculated and sorted to identify popular items bought together. Visualizations are created to explore these relationships, providing insights for marketing or sales strategies based on product associations.

# View the rules
inspect(head(rules))
##     lhs                                  rhs                                  support confidence   coverage      lift count
## [1] {CHILDRENS CUTLERY DOLLY GIRL }   => {CHILDRENS CUTLERY SPACEBOY }     0.01122283  0.7594502 0.01477757 41.541924   221
## [2] {CHILDRENS CUTLERY SPACEBOY }     => {CHILDRENS CUTLERY DOLLY GIRL }   0.01122283  0.6138889 0.01828154 41.541924   221
## [3] {CHILDRENS CUTLERY POLKADOT BLUE} => {CHILDRENS CUTLERY POLKADOT PINK} 0.01117205  0.7612457 0.01467601 35.438416   220
## [4] {CHILDRENS CUTLERY POLKADOT PINK} => {CHILDRENS CUTLERY POLKADOT BLUE} 0.01117205  0.5200946 0.02148080 35.438416   220
## [5] {PAINTED METAL PEARS ASSORTED}    => {ASSORTED COLOUR BIRD ORNAMENT}   0.01315255  0.7000000 0.01878936  9.493388   259
## [6] {PINK HAPPY BIRTHDAY BUNTING}     => {BLUE HAPPY BIRTHDAY BUNTING}     0.01330490  0.6517413 0.02041438 32.907922   262
##                                Product Frequency
## 1            {JUMBO BAG RED RETROSPOT}        86
## 2        {RED RETROSPOT CHARLOTTE BAG}        45
## 3             {JUMBO STORAGE BAG SUKI}        42
## 4            {LUNCH BAG RED RETROSPOT}        41
## 5            {JUMBO BAG PINK POLKADOT}        36
## 6          {CHARLOTTE BAG SUKI DESIGN}        31
## 7  {JUMBO SHOPPER VINTAGE RED PAISLEY}        30
## 8            {LUNCH BAG  BLACK SKULL.}        29
## 9             {WOODLAND CHARLOTTE BAG}        27
## 10       {CHARLOTTE BAG PINK POLKADOT}        24
library(arulesViz)

plot(rules, engine = "plotly")
plot(subrules, method = "graph",  engine = "htmlwidget")